Predicting U.N. World Happiness

Load U.N. World Happiness Data

Explore descriptive statistics and bivariate relationships

Summary statistics of average feature values by happiness categories

The plot above shows a strong correlation between GDP, social support, health life expectany and happiness level. As these category values increase, happiness level also increases. This trend also exists for Freedon to make life choices, but the relationship is not as strong. Generosity and perception of corruption are relatively similar for High, Average, and Low happiness leveles with slight increases in these categories for Very High and Very Low happiness; the correlation between these two variables and happiness level is not as strong as the other variables.

This boxplot supports the above trend: there is very little trend between happiness level and generosity.

This boxplot also supports the trend in the initial plot: the highest mean for social support is among Very High happiness level while the lowest mean is among the Very Low happiness level with a decreasing trend between those happiness categories.

This barplot shows the dispersion, or number of observations, by happiness level. The categories are relatively evenly distributed.

Lastly, I include a correlation matrix. The lighter shade represents a higher correlation between two variables while a darker shade indicates a lower corrleation. The most important row to analyze is the final one showing variable correlation between happiness level and the covariates. Happiness level is most correlated with GDP per capital and life expectancy both at 0.79, followed by social support at 0.74. As identified earlier, generosity has the lowest correlation of 0.028.

Examining Feature Importance

The outcome variable, happiness level, is a 5 class outcome variable. To examine feature importance, I run a random forest classifier to get an idea of what features hold the highest gini-importance. Similar to the trends discovered in the bivariate analyses above, GDP per captia, Social support, and life expectancy rate hold the highest values in gini-importance.

Another way to understand feature importance is with Principal Component Analysis.

This initial plot below shows that nearly 80% of the variance in the data is explained by the first two principal components, and nearly 90% with the first four.

This plot below shows a 2D PCA scatter plot of the first two components. The point color corresponds to the outcome variable, happiness level, with 5 (yellow) being Very High and 1 (purple) being Very Low

The plot below is a biplot, showing the importance of feature by eigenvector. A higher magnitude eigenvector suggests a feature with greater importance.

GDP per captial and Social support are of the largest magnitude suggesting they are more important features.

GDP per capita and Social support are the two most important variables in Principal Component 1.

Build model to predict happiness ranking

Keras

Random Forest

Gradient Boosting

XGBoost

SVM

I predict happiness level with 5 machine learning algorithms. The best performing model is Support Vector Machine with the best mean cross-validation f1-score of .63 and test-set f1-score of .53. The best SVM parameters are:


Failed Trials

Below, I include code used to try (unsuccessfully) to improve the best performing SVM model. First, I try modeling with only the most important features, then I try adding additional outside data to the original UN Happiness Index dataset.

What if we only try including the most important features?

The SVM model does not perform nearly as well with only a subset of features, and the best parameters selected are different. While features like generosity and perceptions of corruption are not the most important features in the dataset, they are still useful for predicting happiness levels!

What if we add more data?

Additional data from OECD Better Life Index

While this additional dataset is comprehensive, one major problem is that it only has observations on one-quarter of the dataset's countries! That means there is a lot of missing observations. Since I still want to try adding this dataset to the original due to the many additional fields it adds, I try overcoming my missing data problem by grouping by region (continent) and averaging the values for each feature that are present.

Again, the SVM model does not perform nearly as well as the best performing model. This model performs slightly better than the one with only the three most important features, but it still fails to reach the same scores as the best model with only the origianl data. More data (espeically lots of missing data) is not necessarily helpful for improving a model's performance.

The plot belows shows feature importance with the additional data. Compared to the original dataset, all of the additional features are much less important on the gini-importance scale.